dipwmsearch: a Python package for searching di-PWM motifs

Abstract Motivation Seeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites or other RNA/DNA binding sites. Useful motif representations include position weight matrices (PWMs), dinucleotide PWMs (di-PWMs), and hidden Markov models (HMMs). Dinucleotide PWMs not only combine the simplicity of PWMs—a matrix form and a cumulative scoring function—but also incorporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences. Results We propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then searches these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs. Availability and implementation dipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.

1 Access to the packages, the source code and the documentation. We summarize some features between the tools SPRy-SARUS (url), MOODS [3], and dipwmsearch [5], in the table below.

Feature
SPRy-SARUS MOODS v3 dipwmsearch conda installation n y y pypi installation n y y IUPAC code (except N) blocking y y API programming library n y y output matching word n y y compressed sequence file n y y (via gzip/bzip python packages) multi FASTA y y y (example)

The LookAheadMatrix (LAM): definition and algorithm
The LAM is an additional matrix that contains look ahead scores. It is a generalisation of the LookAheadTable (proposed in [2]). We give a formal definition of the LAM below. The LAM is used in the Optimized Scanning algorithm, and in the enumeration step of both the enumeration based algorithm (FE) and the Core Enumeration algorithm (CE).
In this manuscript, we consider DNA/RNA sequences and assume the classical alphabet, denoted by Σ, is {A,C, G, T } or {A,C, G,U}. Sequences can also include symbols from the IUPAC DNA alphabet. The symbol σ denotes the alphabet size (e.g., in general σ = 4 in the case of DNA/RNA sequences). The algorithm is valid for any fixed alphabet Σ of size σ.
For DNA motif, the di-PWM matrices consider only the classical {A,C, G, T } alphabet, and hence in the following algorithm, we have σ = 4 (In other words, di-PWM matrices do not use non-ACGT symbols).
Algorithm 1: MakeLookAheadMatrix: compute the LookAheadMatrix of a di-PWM P. Both algorithms take as input a di-PWM, a sequence in which motif occurrences are sought, and a score threshold. The optimized scanning algorithm (OS) The optimized scanning algorithm (OS) follows a window scanning strategy: it scans the sequence from left to right and consider each possible window of length m. For each window, a naive algorithm would compute the score according to the di-PWM by summing the score of the current dinucleotide for each position in the window but the last. The OS algorithm iteratively computes the score of each prefix (from length 2 to m) of the window and check whether the maximal reachable score over the entire window is larger than the score threshold. If not, it stops computing scores for the current window, since the score of the current window cannot achieve the required threshold, and it continues with the next window. The maximal reachable score for a prefix of length j is obtained by summing the current prefix score with the LAM entry indexed by the last symbol of the prefix and j + 1.

The enumeration based algorithm for full di-PWM (FE).
This strategy is radically different from the window scanning one. We first enumerate all words of length m whose score is larger than the threshold t, which we call valid words, and then to search the set of valid words in T using an Aho-Corasick (AC) automaton [1]. Since there is an available and efficient implementation of the Aho-Corasick (AC) automaton in Python, the key lies in the enumeration algorithm, which is described in the section entitled "Enumeration of valid words: B&B approach and LAM" of the manuscript. The properties of the LAM induces the optimality of the Branch and Bound approach when exploring the trie of words in depth first search order: indeed, the score of full words is only entirely computed for valid words. This means that the algorithm stops computing the score of a prefix of a word (exploring the corresponding branch of trie) as soon one additional symbol makes the threshold score unreachable. The enumeration step outputs the list of valid words. A drawback is that the list can exhaust the available RAM on the computer. This occurs when the input di-PWM contains non selective columns. To avoid such a scenario, we designed the Core Optimized algorithm, which is explained in the main manuscript. 7 Protocol for comparing running times Currently, it is not possible to search for di-PWM using MOODS v3: MOODS cannot read the format for di-PWM matrices containing scores (an issue was sent on the github repository on Sep 22, 2022). This is why MOODS was not used in the comparisons.
Comparisons between dipwmsearch and SPRy-SARUS were performed as follows, using two bash scripts on a Linux system. Technical information are summarised below.
For both chromosome sequences (15 and 3), for all Human di-PWM motifs of HOCOMOCO database [4], for four ratios (0.8, 0.85, 0.9, 0.95), we recorded the running time of dipwmsearch and SPRy-SARUS using the time command. Each search for a di-PWM was performed in a distinct execution of the tools, implying that each time the target sequence and the di-PWM were loaded, the enumeration of valid words and then the scanning of the sequence performed, before storing the results in an output file on the disk. We did not take advantage of the possibility for dip to search for a multi FASTA file containing both sequences at once (which would have avoided to redo the enumeration step). For dipwmsearch, we used a python script that calls the procedure search_block_optimized for searching the di-PWM first on the Watson strand and then on the Crick strand. The time was measured using the time command from Python package time. For SPRy-SARUS, we recorded the time taken by program in user mode, using the /usr/bin/time command with option -f "%U". Both tools were run using a single thread; despite this, it occurs that the JAVA machine running SPRy-SARUS used more than 100% of the CPU.

Technical information
The

Variation of running time with respect to the HOCOMOCO di-PWM
The summary of results reported in the article present the running times in seconds for all HOCOMOCO di-PWM motifs. As mentioned above, the di-PWMs of HOCOMOCO vary in information content and thus in selectivity. This implies that the number of valid words strongly depends on the motif and the ratio, and hence, so do the enumeration and overall running times. The Figures 2 provides a view on the variability of the running times in function of the di-PWM and ratio, for both SPRy-SARUS (left plot) and dipwmsearch (right plot). We report the running times for searching on Human chromosomes 15 and 3. Note that because SPRy-SARUS can only process sequences containing A, C, G, T (but no other IUPAC codes), the sequence of chromosome 3 was slightly modified to replace undetermined positions by one standard nucleotide (before feeding SPRy-SARUS). For dipwmsearch, we used the normal chromosome 3 sequence with IUPAC codes. The variation of running times in comparison to the median running time is higher for dipwmsearch than for SPRy-SARUS. For dipwmsearch, it depends on the matrix and ratio, while SPRy-SARUS times depend mostly on the sequence length. However, the overal running times of dipwmsearch remain fast enough for practical uses.